Deep Learning Course - Assignment 1

Dog Breed Identification

Submitted by: Itay Bouganim, Ido Rom and Shauli Genish

Problem Statement

We are provided with a training set and a test set of images of dogs. Each image has a filename that is its unique id. The dataset comprises 120 breeds of dogs. The goal is to create a classifier capable of determining a dog's breed from a photo. The list of breeds is as follows

Task Description

In this Task, we were provided a strictly canine subset of ImageNet in order to practice fine-grained image categorization. How well we can tell our Norfolk Terriers from our Norwich Terriers? With 120 breeds of dogs and a limited number training images per class, we might find the problem more, err, ruff than we anticipated.

Kaggle competition link - containing the dataset and labels

Check for existing physical GPU

1. Preprocessing The Data

1.a. Check our dataset size for the train and test data

1.b. Read labels and assign filenames

The dataset does not contain labels for the test data.</br> We will read the attached csv file containing labels only for the train samples.

We can see that we indeed have 120 labeles in the dataset,
The training data is not labeled so in order to determine our progress we will submit our results to Kaggle and use some of the validation data that our model did not train on to help us visualize our progress

1.c. Look at the sample count for each label

Explore the sample count from each label

Most common labels

Least common labels

Plot labels distibution histogram

Uneven data problem

Since we have an uneven data for each label, we can expect that the learning process will be harder on some of the samples more than other.

We need to overcome this by weighting our labels relative to their presence in the dataset to avoid diversion.

Further data analysis

We can see that the train samples from each label varies.
We want to understand the amount and dimensions of our image data better

Plot the distribution of images class count and images widths heights

Uneven data dimensions

We can see that our images dimension varies, we will need to resize the samples before passing them to our model

1.d. Simillar Datasets Benchmarks

The following articles are strongly related to the problem that we are trying to solve here.
Although they don't not use the exact same dog breed dataset, they use a dog breed data set that has a label count simillar to ours and mostly use transfer learning techniques.
Dog Breed Identification - Stanford University
Dog Breed Identification - University of Waterloo
Dog Breed Identification - ResearchGate

The following tables and charts show a comparison of results gotten using different known CNN models to solve a simillar dog-identification problem:
We can see the metrics for train and validation data with and without augmentation.

1.e. Plot samples from our dog breed image samples

From a brief examination of the data we can see that it varies

All of those factors and the fact that this is a relatively small dataset for 120 labels, can make the training process harder and need to be taken into account.

Distinguisable samples

African Hunting Dog
Shi-Tzu

Hard to distinguish samples

Samoyed
Great Pyrenees

Dog Focus differences

Mainly Face
Mainly Scenery

Showing not only dogs

Obstucted by noticable items

Showing not only dogs